XArray & CF Introduction

Unidata Sustainable Science Workshop


Overview:

  • Teaching: 25 minutes
  • Exercises: 20 minutes

Questions

  1. What is XArray?
  2. How does XArray fit in with Numpy and Pandas?
  3. What is the CF convention and how do we use it with Xarray?

Objectives

  1. Create a DataArray.
  2. Open netCDF data using XArray
  3. Subset the data.
  4. Write a CF-compliant netCDF file

XArray

XArray expands on the capabilities on NumPy arrays, providing a lot of streamlined data manipulation. It is similar in that respect to Pandas, but whereas Pandas excels at working with tabular data, XArray is focused on N-dimensional arrays of data (i.e. grids). Its interface is based largely on the netCDF data model (variables, attributes, and dimensions), but it goes beyond the traditional netCDF interfaces to provide functionality similar to netCDF-java's Common Data Model (CDM).

DataArray

The DataArray is one of the basic building blocks of XArray. It provides a NumPy ndarray-like object that expands to provide two critical pieces of functionality:

  1. Coordinate names and values are stored with the data, making slicing and indexing much more powerful
  2. It has a built-in container for attributes

In [ ]:
# Convention for import to get shortened namespace
import numpy as np
import xarray as xr

In [ ]:
# Create some sample "temperature" data
data = 283 + 5 * np.random.randn(5, 3, 4)
data

Here we create a basic DataArray by passing it just a numpy array of random data. Note that XArray generates some basic dimension names for us.


In [ ]:
temp = xr.DataArray(data)
temp

We can also pass in our own dimension names:


In [ ]:
temp = xr.DataArray(data, dims=['time', 'lat', 'lon'])
temp

This is already improved upon from a numpy array, because we have names for each of the dimensions (or axes in NumPy parlance). Even better, we can take arrays representing the values for the coordinates for each of these dimensions and associate them with the data when we create the DataArray.


In [ ]:
# Use pandas to create an array of datetimes
import pandas as pd
times = pd.date_range('2018-01-01', periods=5)
times

In [ ]:
# Sample lon/lats
lons = np.linspace(-120, -60, 4)
lats = np.linspace(25, 55, 3)

When we create the DataArray instance, we pass in the arrays we just created:


In [ ]:
temp = xr.DataArray(data, coords=[times, lats, lons], dims=['time', 'lat', 'lon'])
temp

...and we can also set some attribute metadata:


In [ ]:
temp.attrs['units'] = 'kelvin'
temp.attrs['standard_name'] = 'air_temperature'

temp

Notice what happens if we perform a mathematical operaton with the DataArray: the coordinate values persist, but the attributes are lost. This is done because it is very challenging to know if the attribute metadata is still correct or appropriate after arbitrary arithmetic operations.


In [ ]:
# For example, convert Kelvin to Celsius
temp - 273.15

Selection

We can use the .sel method to select portions of our data based on these coordinate values, rather than using indices (this is similar to the CDM).


In [ ]:
temp.sel(time='2018-01-02')

.sel has the flexibility to also perform nearest neighbor sampling, taking an optional tolerance:


In [ ]:
from datetime import timedelta
temp.sel(time='2018-01-07', method='nearest', tolerance=timedelta(days=2))

Exercise

.interp() works similarly to .sel(). Using .interp(), get an interpolated time series "forecast" for Boulder (40°N, 105°W) or your favorite latitude/longitude location. (Documentation for interp).


In [ ]:
# Your code goes here

Solution


In [ ]:
# %load solutions/interp_solution.py

Slicing with Selection


In [ ]:
temp.sel(time=slice('2018-01-01', '2018-01-03'), lon=slice(-110, -70), lat=slice(25, 45))

.loc

All of these operations can also be done within square brackets on the .loc attribute of the DataArray. This permits a much more numpy-looking syntax, though you lose the ability to specify the names of the various dimensions. Instead, the slicing must be done in the correct order.


In [ ]:
# As done above
temp.loc['2018-01-02']

In [ ]:
temp.loc['2018-01-01':'2018-01-03', 25:45, -110:-70]

In [ ]:
# This *doesn't* work however:
#temp.loc[-110:-70, 25:45,'2018-01-01':'2018-01-03']

Opening netCDF data

With its close ties to the netCDF data model, XArray also supports netCDF as a first-class file format. This means it has easy support for opening netCDF datasets, so long as they conform to some of XArray's limitations (such as 1-dimensional coordinates).


In [ ]:
# Open sample North American Reanalysis data in netCDF format
ds = xr.open_dataset('../../data/NARR_19930313_0000.nc')
ds

This returns a Dataset object, which is a container that contains one or more DataArrays, which can also optionally share coordinates. We can then pull out individual fields:


In [ ]:
ds.isobaric1

or


In [ ]:
ds['isobaric1']

Datasets also support much of the same subsetting operations as DataArray, but will perform the operation on all data:


In [ ]:
ds_1000 = ds.sel(isobaric1=1000.0)
ds_1000

In [ ]:
ds_1000.Temperature_isobaric

Aggregation operations

Not only can you use the named dimensions for manual slicing and indexing of data, but you can also use it to control aggregation operations, like sum:


In [ ]:
u_winds = ds['u-component_of_wind_isobaric']
u_winds.std(dim=['x', 'y'])

Exercise

Using the sample dataset, calculate the mean temperature profile (temperature as a function of pressure) over Colorado within this dataset. For this exercise, consider the bounds of Colorado to be:

  • x: -182km to 424km
  • y: -1450km to -990km

(37°N to 41°N and 102°W to 109°W projected to Lambert Conformal projection coordinates)

Solution


In [ ]:
# %load solutions/mean_profile.py

Resources

There is much more in the XArray library. To learn more, visit the XArray Documentation

Introduction to Climate and Forecasting Metadata Conventions

In order to better enable reproducible data and research, the Climate and Forecasting (CF) metadata convention was created to have proper metadata in atmospheric data files. In the remainder of this notebook, we will introduce the CF data model and discuss some netCDF implementation details to consider when deciding how to write data with CF and netCDF. We will cover gridded data in this notebook, with more in depth examples provided in the full CF notebook. Xarray makes the creation of netCDFs with proper metadata simple and straightforward, so we will use that, instead of the netCDF-Python library.

This assumes a basic understanding of netCDF.

Gridded Data

Let's say we're working with some numerical weather forecast model output. Let's walk through the steps necessary to store this data in netCDF, using the Climate and Forecasting metadata conventions to ensure that our data are available to as many tools as possible.

To start, let's assume the following about our data:

  • It corresponds to forecast three dimensional temperature at several times
  • The native coordinate system of the model is on a regular grid that represents the Earth on a Lambert conformal projection.

We'll also go ahead and generate some arrays of data below to get started:


In [ ]:
# Import some useful Python tools
from datetime import datetime

# Twelve hours of hourly output starting at 22Z today
start = datetime.utcnow().replace(hour=22, minute=0, second=0, microsecond=0)
times = np.array([start + timedelta(hours=h) for h in range(13)])

# 3km spacing in x and y
x = np.arange(-150, 153, 3)
y = np.arange(-100, 100, 3)

# Standard pressure levels in hPa
press = np.array([1000, 925, 850, 700, 500, 300, 250])

temps = np.random.randn(times.size, press.size, y.size, x.size)

Time coordinates must contain a units attribute with a string value with a form similar to 'seconds since 2019-01-06 12:00:00.00'. 'seconds', 'minutes', 'hours', and 'days' are the most commonly used units for time. Due to the variable length of months and years, they are not recommended.

Before we can write data, we need to first need to convert our list of Python datetime instances to numeric values. We can use the cftime library to make this easy to convert using the unit string as defined above.


In [ ]:
from cftime import date2num
time_units = 'hours since {:%Y-%m-%d 00:00}'.format(times[0])
time_vals = date2num(times, time_units)
time_vals

Now we can create the forecast_time variable just as we did before for the other coordinate variables:

Convert arrays into Xarray Dataset


In [ ]:
ds = xr.Dataset({'temperature': (['time', 'z', 'y', 'x'], temps, {'units':'Kelvin'})},
                 coords={'x_dist': (['x'], x, {'units':'km'}),
                         'y_dist': (['y'], y, {'units':'km'}),
                         'pressure': (['z'], press, {'units':'hPa'}),
                         'forecast_time': (['time'], times)
                })
ds

Due to how xarray handles time units, we need to encode the units in the forecast_time coordinate.


In [ ]:
ds.forecast_time.encoding['units'] = time_units

If we look at our data variable, we can see the units printed out, so they were attached properly!


In [ ]:
ds.temperature

We're going to start by adding some global attribute metadata. These are recommendations from the standard (not required), but they're easy to add and help users keep the data straight, so let's go ahead and do it.


In [ ]:
ds.attrs['Conventions'] = 'CF-1.7'
ds.attrs['title'] = 'Forecast model run'
ds.attrs['nc.institution'] = 'Unidata'
ds.attrs['source'] = 'WRF-1.5'
ds.attrs['history'] = str(datetime.utcnow()) + ' Python'
ds.attrs['references'] = ''
ds.attrs['comment'] = ''
ds

We can also add attributes to this variable to define metadata. The CF conventions require a units attribute to be set for all variables that represent a dimensional quantity. The value of this attribute needs to be parsable by the UDUNITS library. Here we have already set it to a value of 'Kelvin'. We also set the standard (optional) attributes of long_name and standard_name. The former contains a longer description of the variable, while the latter comes from a controlled vocabulary in the CF conventions. This allows users of data to understand, in a standard fashion, what a variable represents. If we had missing values, we could also set the missing_value attribute to an appropriate value.

NASA Dataset Interoperability Recommendations:

Section 2.2 - Include Basic CF Attributes

Include where applicable: units, long_name, standard_name, valid_min / valid_max, scale_factor / add_offset and others.


In [ ]:
ds.temperature.attrs['standard_name'] = 'air_temperature'
ds.temperature.attrs['long_name'] = 'Forecast air temperature'
ds.temperature.attrs['missing_value'] = -9999
ds.temperature

Coordinate variables

To properly orient our data in time and space, we need to go beyond dimensions (which define common sizes and alignment) and include values along these dimensions, which are called "Coordinate Variables". Generally, these are defined by creating a one dimensional variable with the same name as the respective dimension.

To start, we define variables which define our x and y coordinate values. These variables include standard_names which allow associating them with projections (more on this later) as well as an optional axis attribute to make clear what standard direction this coordinate refers to.


In [ ]:
ds.x.attrs['axis'] = 'X' # Optional
ds.x.attrs['standard_name'] = 'projection_x_coordinate'
ds.x.attrs['long_name'] = 'x-coordinate in projected coordinate system'

ds.y.attrs['axis'] = 'Y' # Optional
ds.y.attrs['standard_name'] = 'projection_y_coordinate'
ds.y.attrs['long_name'] = 'y-coordinate in projected coordinate system'

We also define a coordinate variable pressure to reference our data in the vertical dimension. The standard_name of 'air_pressure' is sufficient to identify this coordinate variable as the vertical axis, but let's go ahead and specify the axis as well. We also specify the attribute positive to indicate whether the variable increases when going up or down. In the case of pressure, this is technically optional.


In [ ]:
ds.pressure.attrs['axis'] = 'Z'  # Optional
ds.pressure.attrs['standard_name'] = 'air_pressure'
ds.pressure.attrs['positive'] = 'down'  # Optional

In [ ]:
ds.forecast_time['axis'] = 'T'  # Optional
ds.forecast_time['standard_name'] = 'time'  # Optional
ds.forecast_time['long_name'] = 'time'

Auxilliary Coordinates

Our data are still not CF-compliant because they do not contain latitude and longitude information, which is needed to properly locate the data. To solve this, we need to add variables with latitude and longitude. These are called "auxillary coordinate variables", not because they are extra, but because they are not simple one dimensional variables.

Below, we first generate longitude and latitude values from our projected coordinates using the pyproj library.


In [ ]:
from pyproj import Proj
X, Y = np.meshgrid(x, y)
lcc = Proj({'proj':'lcc', 'lon_0':-105, 'lat_0':40, 'a':6371000.,
            'lat_1':25})
lon, lat = lcc(X * 1000, Y * 1000, inverse=True)

Now we can create the needed variables. Both are dimensioned on y and x and are two-dimensional. The longitude variable is identified as actually containing such information by its required units of 'degrees_east', as well as the optional 'longitude' standard_name attribute. The case is the same for latitude, except the units are 'degrees_north' and the standard_name is 'latitude'.


In [ ]:
ds = ds.assign_coords(lon = (['y', 'x'], lon))
ds = ds.assign_coords(lat = (['y', 'x'], lat))
ds

In [ ]:
ds.lon.attrs['units'] = 'degrees_east'
ds.lon.attrs['standard_name'] = 'longitude'  # Optional
ds.lon.attrs['long_name'] = 'longitude'

ds.lat.attrs['units'] = 'degrees_north'
ds.lat.attrs['standard_name'] = 'latitude'  # Optional
ds.lat.attrs['long_name'] = 'latitude'

With the variables created, we identify these variables as containing coordinates for the Temperature variable by setting the coordinates value to a space-separated list of the names of the auxilliary coordinate variables:


In [ ]:
ds

Coordinate System Information

With our data specified on a Lambert conformal projected grid, it would be good to include this information in our metadata. We can do this using a "grid mapping" variable. This uses a dummy scalar variable as a namespace for holding all of the required information. Relevant variables then reference the dummy variable with their grid_mapping attribute.

Below we create a variable and set it up for a Lambert conformal conic projection on a spherical earth. The grid_mapping_name attribute describes which of the CF-supported grid mappings we are specifying. The names of additional attributes vary between the mappings.


In [ ]:
ds['lambert_projection'] = int()
ds.lambert_projection.attrs['grid_mapping_name'] = 'lambert_conformal_conic'
ds.lambert_projection.attrs['standard_parallel'] = 25.
ds.lambert_projection.attrs['latitude_of_projection_origin'] = 40.
ds.lambert_projection.attrs['longitude_of_central_meridian'] = -105.
ds.lambert_projection.attrs['semi_major_axis'] = 6371000.0
ds.lambert_projection

Now that we created the variable, all that's left is to set the grid_mapping attribute on our Temperature variable to the name of our dummy variable:


In [ ]:
ds.temperature.attrs['grid_mapping'] = 'lambert_projection'  # or proj_var.name
ds

Write to NetCDF

Xarray has built-in support for a few flavors of netCDF. Here we'll write a netCDF4 file from our Dataset.


In [ ]:
ds.to_netcdf('test_netcdf.nc', format='NETCDF4')

In [ ]:
!ncdump test_netcdf.nc

In [ ]: